BeautifulSoup4的find

2023-11-13 22:31| 来源: 网络整理| 查看: 265

正则表达式+BeautifulSoup爬取网页可事半功倍。

就拿百度贴吧网址来练练手：https://tieba.baidu.com/index.html

1.find_all()：搜索当前节点的所有子节点，孙子节点。

下面例子是用find_all()匹配贴吧分类模块，href链接中带有“娱乐”两字的链接。

from bs4 import BeautifulSoup from urllib.request import urlopen import re f = urlopen('https://tieba.baidu.com/index.html').read() soup = BeautifulSoup(f,'html.parser') for link in soup.find_all('a',href=re.compile('娱乐')): #这里用了正则表达式来过滤 print(link.get('title')+':'+link.get('href')) 结果：娱乐明星:/f/index/forumpark?pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 港台东南亚明星:/f/index/forumpark?cn=港台东南亚明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 内地明星:/f/index/forumpark?cn=内地明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 韩国明星:/f/index/forumpark?cn=韩国明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 日本明星:/f/index/forumpark?cn=日本明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 时尚人物:/f/index/forumpark?cn=时尚人物&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 欧美明星:/f/index/forumpark?cn=欧美明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 主持人:/f/index/forumpark?cn=主持人&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 其他娱乐明星:/f/index/forumpark?cn=其他娱乐明星&ci=0&pcn=娱乐明星&pci=0&ct=1&rn=20&pn=1 soup.find_all('a',href=re.compile('娱乐')) 等效于：soup('a',href=re.compile('娱乐'))上面的例子也可以用soup代替。

**如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数。通过一个方法来过滤一类标签属性的时候, 这个方法的参数是要被过滤的属性的值, 而不是这个标签.

import re def abc(href): return href and not re.compile('娱乐明星').search(href) print(soup.find_all(href=abc))

find_all()的参数：find_all( name , attrs , recursive , string , **kwargs )

爱综艺

find_all('a') ：查找所有标签

find_all(title='爱综艺')：查找所有属性包含“title='爱综艺'”的标签

find(string=re.compile('贴吧'))：查找第一个标签中包含“贴吧”的字符串

find_all(href=re.compile('娱乐明星'),title='娱乐明星')：多个指定名字的参数可以同时过滤tag的多个属性

find_all(attrs={"title": "娱乐明星"})：可以用attrs来搜索包含特殊属性（无法直接搜索的标签属性）的tag

find_all(href=re.compile('娱乐明星'),limit=3)：limit参数限制返回结果的数量

2.通过CSS选择器来查找tag，select()循环你需要的内容：

** 搜索html页面中a标签下以“/f/index”开头的href：

for link2 in soup.select('a[href^="/f/index"]'): print(link2.get('title')+':'+link2.get('href')) **搜索html页面中a标签下以“&pn=1”结尾的href： for link2 in soup.select('a[href$="&pn=1"]'): print(link2.get('title')+':'+link2.get('href')) **搜索html页面中a标签下包含“娱乐”的href： for link3 in soup.select('a[href*="娱乐"]'): print(link3.get('title')+':'+link3.get('href'))

soup.select('meta')：根据标签查找

soup.select('html meta link')：根据标签逐层查找

soup.select('meta > link:nth-of-type(3)')：找到meta标签下的第3个link子标签

soup.select('div > #head')：找到div标签下，属性id=head的子标签

soup.select('div > a')：找到div标签下，所有a标签

soup.select("#searchtb ~ .authortb")：找到id=searchtb标签的class=authortb兄弟节点标签

soup.select("[class~=m_pic]") 和 soup.select(".m_pic")：找到class=m_pic的标签

soup.select(".tag-name,.post_author")：同时用多种CSS选择器查询

【本文地址】

公司简介

联系我们